We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The survey program has been conducted as a series of surveys designed to assess the health and nutritional status of adults and children in the United States since the 1960s, according to CDC (2023). It combines in-person face-to-face interviews and physical examinations of participants for data collection.
The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.
We aim to study the relationship between the weight variable and the other health related variables of the data.
We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests on some of the variables. Lastly we did a linear regression model fit to the response variable “weight” with other variables and confounders.
We began our analysis by giving a data dictionary of the data shown in Table 1 below. As one can see that some variables have a high percentage of missing values. In Part 2 we made hypothesis tests to decide if some of these variables could be excluded from the regression analysis in Part 3.
The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. As one could see there was an obese variable in the data. The weight variable was categorized by giving a threshold of 35 to the BMI value. A person is considered healthy if the BMI is below 35, and obese otherwise. Therefore, we used the obese variable as the categorical random variable in our project.
| Variables | Type | Example | Number.Unique | MissingPct | Comment |
|---|---|---|---|---|---|
| id | integer | 1, 2, 3 | 6482 | 0% | Identification Code (1 - 6482) |
| gender | factor | Male, Female | 2 | 0% | Gender (1: Male, 2: Female) |
| age | integer | 34, 16, 60 | 65 | 0% | Age (Years) |
| marstat | factor | Married, NA, Widowed | 6 | 9.7% | Marital Status (1: Married, 2: Widowed, 3: Divorced, 4: Separated, 5: Never Married, 6: Living Together) |
| samplewt | numeric | 80100.544, 13953.078, 20090.339 | 2499 | 0% | Statistical Weight (4084.478 - 153810.3) |
| psu | integer | 1, 2 | 2 | 0% | Pseudo-PSU (1, 2) |
| strata | integer | 9, 10, 1 | 15 | 0% | Pseudo-Stratum (1 - 15) |
| tchol | integer | 135, 192, 202 | 251 | 6.09% | Total Cholesterol (mg/dL) |
| hdl | integer | 50, 60, 45 | 112 | 6.09% | HDL-Cholesterol (mg/dL) |
| sysbp | integer | 114, 112, 154 | 61 | 8.53% | Systolic Blood Pressure (mm Hg) |
| dbp | integer | 88, 62, 70 | 40 | 9.16% | Diastolic Blood Pressure (mm Hg) |
| wt | numeric | 87.400002, 72.300003, 116.8 | 957 | 0.57% | Weight (kg) |
| ht | numeric | 164.7, 181.3, 166 | 527 | 0.57% | Standing Height (cm) |
| bmi | numeric | 32.22, 22, 42.39 | 2276 | 0.57% | Body mass Index (Kg/m^2) |
| vigwrk | factor | No, Yes, NA | 2 | 0.02% | Vigorous Work Activity (1: Yes, 2: No) |
| modwrk | factor | No, Yes, NA | 2 | 0.02% | Moderate Work Activity (1: Yes, 2: No) |
| wlkbik | factor | No, Yes, NA | 2 | 0.02% | Walk or Bicycle (1: Yes, 2: No) |
| vigrecexr | factor | No, Yes, NA | 2 | 0.02% | Vigorous Recreational Activities (1: Yes, 2: No) |
| modrecexr | factor | No, Yes, NA | 2 | 0.03% | Moderate Recreational Activities (1: Yes, 2: No) |
| sedmin | integer | 480, 240, 720 | 37 | 1.22% | Minutes of Sedentary Activity per Week (0 - 840) |
| obese | factor | No, Yes, NA | 2 | 0.57% | BMI>35 (1: No, 2: Yes) |
According to CDC’s classification on bodyweight, we have: BMI<18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI>30 as obesity. We adopted this category and found that there was a slight positive relationship between bodyweight and the total cholesterol level. However, we noticed that there was a negative relationship between the HDL and bodyweight. Because of the fact that Tchol is the sum of HDL and LDL, we can conclude that the obese population has a high level of LDL and a low level HDL.
According to ATPIII (n.d.), we can also categorize the cholesterol level.
We first test the independence between obesity and marital status. We form the following contingency table:
| No | Yes | ||
|---|---|---|---|
| Marital Status | Married | 2530 | 474 |
| Widowed | 418 | 86 | |
| Divorced | 528 | 112 | |
| Separated | 158 | 35 | |
| Never Married | 863 | 160 | |
| Living Together | 388 | 66 |
Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:
\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]We use the chi-squared test to conclude that there is not enough evidence to reject the null hypothesis with a p-value equal to 0.6894. In other words, we cannot conclude that there is a relationship between obesity and marital status.
We do the same test for other variables compared with obesity. From Table 2 we can see that we can reject the independence between obesity and wlkbik, vigrecexr and modrecexr variables.
| vigwrk | modwrk | wlkbik | vigrecexr | modrecexr | |
|---|---|---|---|---|---|
| p-value | 0.5695 | 0.3037 | 1.064e-07 | 4.061e-15 | 2.573e-09 |
Based on thresholds established by clinicians, we categorizing four variables related to the cholesterol and blood pressure level.
Cholesterol is an essential fat in the body. According to American Heart Association, We have: tchol of 240mg/dl or higher OR hdl under 40mg/dL as Dangerous Cholesterol level; tchol between 200-239mg/dL OR hdl between 40-59mg/dL for males OR between 50-59mg/dL for females as at risk level; tchol under 200mg/dL and hdl of 60mg/dL and higher as healthy level.
Blood pressure is typically categorized into different stages based on the systolic (top number) and diastolic (bottom number) readings. According to American Heart Association, We have: systolic level less than 120mm Hg and diastolic level less than 80mm Hg as normal blood pressure; systolic level between 120-129mm Hg and diastolic level less than 80mm Hg as elevated blood pressure; systolic level between 130-139mm Hg OR diastolic level between 80-89mm Hg as Hypertension Stage 1; systolic level 140mm Hg or higher OR diastolic level 90mm Hg or higher as Hypertension Stage 2; systolic level higher than 180mm Hg OR diastolic level higher than 120mm Hg as Hypertensive Crisis.
## gender tchol hdl chol_category
## 1 Male 135 50 At Risk
## 2 Male 192 60 Healthy
## 3 Female 202 45 At Risk
## 4 Male 160 45 At Risk
## 5 Female 259 45 Dangerous
## 6 Male 182 75 Healthy
## sysbp dbp bp_category
## 1 114 88 Hypertension Stage 1
## 2 112 62 Normal
## 3 154 70 Hypertension Stage 2
## 4 102 50 Normal
## 5 118 82 Hypertension Stage 1
## 6 142 62 Hypertension Stage 2
Most large-scale surveys often involve a combination of multiple sampling design techniques, like stratification and cluster sampling, enable researchers to obtain accurate estimates while catering to practical and cost considerations. A central tenet to these designs is the concept of sampling weights, pivotal in ensuring unbiased estimation.
Sampling weights are best defined as the inverse of the probability that a specific unit gets selected in the sample. These weights adjust for design-imposed inequalities in selection probabilities and are used tp compute point estimates.
In the case of the stratified random sampling. The population U of size N is partitioned into stratums denoted by \(U_1,...,U_h,...,U_H\). The size of \(h\)th stratum is denoted by \(N_h\).In stratum \(h\), a random sample \(S_h\) of size \(n_h\) is selected based on a sampling design, here we use simple random sampling for simplicity and efficiency. In this stratified random sampling, the estimate of population total can be show as following(Lohr (2022)):
\(\hat{t}_{str}=\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}\)
where the \(w_{hj}=N_h/n_h\) represents the sampling weight of the \(j\)th observation in the \(h\)th stratum, \(y_{hj}\). Note that the probability of sample selection of the \(j\)th unit in the \(h\)th stratum is \(\pi_{hj}=n_h/N_h\).In this case, the sampling weight is the inverse of such probability \(\pi_{hj}\). The unbiased estimator of the population mean \(\bar{y}_U\) can also be shown with sampling weight as following(Lohr (2022)):
\(\hat{\bar{y}}_{str}=\frac{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}}{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}}\)
Cluster sampling is another complex sampling technique for large-scale surveys when the population elements are dispersed and the the fieldwork is costly. We sample the primary sampling units(psu’s) which are often from the natural groupings of the population elements. In cluster sampling, we have \(N\) as the number of psu’s in the population. \(i\)th psu contains \(M_i\) elements. For sample of psu’s, \(S\), We denote \(n\) as the number of psu’s in the sample. For a two-stage cluster sampling, a \(S_i\) sub-sample of secondary sampling units(ssu’s) is chosen from \(i\)th psu, \(i=1,...,n\). The sub-sample size is \(m_i\).
In the case of two-stage cluster sampling with equal probabilities, the sampling weight can be expressed as the following: \(w_{ij}=1/\pi_{ij}=\frac{NM_i}{nm_i}\), where \(\pi_{ij}\) is the probability that the \(j\)th ssu in the \(i\)th psu is in the sample. For unequal probabilities, we need the probability that the \(i\)th psu is in the sample,\(\pi_i\), and the probability that the \(j\)th ssu is in the sample given that the \(i\)th psu is in the sample, \(\pi_{j|i}\). Then, the sampling weight is given by \(w_{ij}=1/(\pi_i\pi_{j|i})\).
With the sampling weight shown above, the estimator of the population total in cluster sampling can also be show in the following forms(Lohr (2022)):
\(\hat{t}=\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}\)
and the estimator of the population mean:
\(\hat{\bar{y}}=\frac{\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}}{\sum_{i\in S}\sum_{j\in S_i}w_{ij}}\)
In essence, sampling weights in complex survey designs play a crucial role in ensuring that the survey results are both accurate and generalizable to the broader population.